2 research outputs found
Neural Video Compression with Temporal Layer-Adaptive Hierarchical B-frame Coding
Neural video compression (NVC) is a rapidly evolving video coding research
area, with some models achieving superior coding efficiency compared to the
latest video coding standard Versatile Video Coding (VVC). In conventional
video coding standards, the hierarchical B-frame coding, which utilizes a
bidirectional prediction structure for higher compression, had been
well-studied and exploited. In NVC, however, limited research has investigated
the hierarchical B scheme. In this paper, we propose an NVC model exploiting
hierarchical B-frame coding with temporal layer-adaptive optimization. We first
extend an existing unidirectional NVC model to a bidirectional model, which
achieves -21.13% BD-rate gain over the unidirectional baseline model. However,
this model faces challenges when applied to sequences with complex or large
motions, leading to performance degradation. To address this, we introduce
temporal layer-adaptive optimization, incorporating methods such as temporal
layer-adaptive quality scaling (TAQS) and temporal layer-adaptive latent
scaling (TALS). The final model with the proposed methods achieves an
impressive BD-rate gain of -39.86% against the baseline. It also resolves the
challenges in sequences with large or complex motions with up to -49.13% more
BD-rate gains than the simple bidirectional extension. This improvement is
attributed to the allocation of more bits to lower temporal layers, thereby
enhancing overall reconstruction quality with smaller bits. Since our method
has little dependency on a specific NVC model architecture, it can serve as a
general tool for extending unidirectional NVC models to the ones with
hierarchical B-frame coding
End-to-End Learnable Multi-Scale Feature Compression for VCM
The proliferation of deep learning-based machine vision applications has
given rise to a new type of compression, so called video coding for machine
(VCM). VCM differs from traditional video coding in that it is optimized for
machine vision performance instead of human visual quality. In the feature
compression track of MPEG-VCM, multi-scale features extracted from images are
subject to compression. Recent feature compression works have demonstrated that
the versatile video coding (VVC) standard-based approach can achieve a BD-rate
reduction of up to 96% against MPEG-VCM feature anchor. However, it is still
sub-optimal as VVC was not designed for extracted features but for natural
images. Moreover, the high encoding complexity of VVC makes it difficult to
design a lightweight encoder without sacrificing performance. To address these
challenges, we propose a novel multi-scale feature compression method that
enables both the end-to-end optimization on the extracted features and the
design of lightweight encoders. The proposed model combines a learnable
compressor with a multi-scale feature fusion network so that the redundancy in
the multi-scale features is effectively removed. Instead of simply cascading
the fusion network and the compression network, we integrate the fusion and
encoding processes in an interleaved way. Our model first encodes a
larger-scale feature to obtain a latent representation and then fuses the
latent with a smaller-scale feature. This process is successively performed
until the smallest-scale feature is fused and then the encoded latent at the
final stage is entropy-coded for transmission. The results show that our model
outperforms previous approaches by at least 52% BD-rate reduction and has
to times less encoding time for object detection. It is
noteworthy that our model can attain near-lossless task performance with only
0.002-0.003% of the uncompressed feature data size.Comment: Under peer review for IEEE TCSV